{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#Mining the Social Web, 2nd Edition\n",
      "\n",
      "##Chapter 4: Mining Google+: Computing Document Similarity, Extracting Collocations, and More\n",
      "\n",
      "This IPython Notebook provides an interactive way to follow along with and explore the numbered examples from [_Mining the Social Web (2nd Edition)_](http://bit.ly/135dHfs). The intent behind this notebook is to reinforce the concepts from the sample code in a fun, convenient, and effective way. This notebook assumes that you are reading along with the book and have the context of the discussion as you work through these exercises.\n",
      "\n",
      "In the somewhat unlikely event that you've somehow stumbled across this notebook outside of its context on GitHub, [you can find the full source code repository here](http://bit.ly/16kGNyb).\n",
      "\n",
      "## Copyright and Licensing\n",
      "\n",
      "You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use."
     ]
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 1. Searching for a person with the Google+ API"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import httplib2\n",
      "import json\n",
      "import apiclient.discovery # pip install google-api-python-client\n",
      "\n",
      "# XXX: Enter any person's name\n",
      "Q = \"Tim O'Reilly\"\n",
      "\n",
      "# XXX: Enter in your API key from  https://code.google.com/apis/console\n",
      "API_KEY = '' \n",
      "\n",
      "service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), \n",
      "                                    developerKey=API_KEY)\n",
      "\n",
      "people_feed = service.people().search(query=Q).execute()\n",
      "\n",
      "print json.dumps(people_feed['items'], indent=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 2. Displaying Google+ avatars in IPython Notebook provides a quick way to disambiguate the search results and discover the person you are looking for"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from IPython.core.display import HTML\n",
      "\n",
      "html = []\n",
      "\n",
      "for p in people_feed['items']:\n",
      "    html += ['<p><img src=\"%s\" /> %s: %s</p>' % \\\n",
      "             (p['image']['url'], p['id'], p['displayName'])]\n",
      "\n",
      "HTML(''.join(html))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 3. Fetching recent activities for a particular Google+ user"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import httplib2\n",
      "import json\n",
      "import apiclient.discovery\n",
      "\n",
      "USER_ID = '107033731246200681024' # Tim O'Reilly\n",
      "\n",
      "# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console\n",
      "# if not currently set\n",
      "# API_KEY = ''\n",
      "\n",
      "service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), \n",
      "                                    developerKey=API_KEY)\n",
      "\n",
      "activity_feed = service.activities().list(\n",
      "  userId=USER_ID,\n",
      "  collection='public',\n",
      "  maxResults='100' # Max allowed per API\n",
      ").execute()\n",
      "\n",
      "print json.dumps(activity_feed, indent=1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 4. Cleaning HTML in Google+ content by stripping out HTML tags and converting HTML entities back to plain-text representations"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from nltk import clean_html\n",
      "from BeautifulSoup import BeautifulStoneSoup\n",
      "\n",
      "# clean_html removes tags and\n",
      "# BeautifulStoneSoup converts HTML entities\n",
      "\n",
      "def cleanHtml(html):\n",
      "  if html == \"\": return \"\"\n",
      "\n",
      "  return BeautifulStoneSoup(clean_html(html),\n",
      "          convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]\n",
      "\n",
      "print activity_feed['items'][0]['object']['content']\n",
      "print\n",
      "print cleanHtml(activity_feed['items'][0]['object']['content'])"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 5. Looping over multiple pages of Google+ activities and distilling clean text from notes"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import os\n",
      "import httplib2\n",
      "import json\n",
      "import apiclient.discovery\n",
      "from BeautifulSoup import BeautifulStoneSoup\n",
      "from nltk import clean_html\n",
      "\n",
      "USER_ID = '107033731246200681024' # Tim O'Reilly\n",
      "\n",
      "# XXX: Re-enter your API_KEY from  https://code.google.com/apis/console \n",
      "# if not currently set\n",
      "# API_KEY = '' \n",
      "\n",
      "MAX_RESULTS = 200 # Will require multiple requests\n",
      "\n",
      "def cleanHtml(html):\n",
      "  if html == \"\": return \"\"\n",
      "\n",
      "  return BeautifulStoneSoup(clean_html(html),\n",
      "          convertEntities=BeautifulStoneSoup.HTML_ENTITIES).contents[0]\n",
      "\n",
      "service = apiclient.discovery.build('plus', 'v1', http=httplib2.Http(), \n",
      "                                    developerKey=API_KEY)\n",
      "\n",
      "activity_feed = service.activities().list(\n",
      "  userId=USER_ID,\n",
      "  collection='public',\n",
      "  maxResults='100' # Max allowed per request\n",
      ")\n",
      "\n",
      "activity_results = []\n",
      "\n",
      "while activity_feed != None and len(activity_results) < MAX_RESULTS:\n",
      "\n",
      "  activities = activity_feed.execute()\n",
      "\n",
      "  if 'items' in activities:\n",
      "\n",
      "    for activity in activities['items']:\n",
      "\n",
      "      if activity['object']['objectType'] == 'note' and \\\n",
      "         activity['object']['content'] != '':\n",
      "\n",
      "        activity['title'] = cleanHtml(activity['title'])\n",
      "        activity['object']['content'] = cleanHtml(activity['object']['content'])\n",
      "        activity_results += [activity]\n",
      "\n",
      "  # list_next requires the previous request and response objects\n",
      "  activity_feed = service.activities().list_next(activity_feed, activities)\n",
      "\n",
      "# Write the output to a file for convenience\n",
      "\n",
      "f = open(os.path.join('resources', 'ch04-googleplus', USER_ID + '.json'), 'w')\n",
      "f.write(json.dumps(activity_results, indent=1))\n",
      "f.close()\n",
      "\n",
      "print str(len(activity_results)), \"activities written to\", f.name"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 6. Sample data structures used in illustrations for the rest of this chapter"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "corpus = { \n",
      " 'a' : \"Mr. Green killed Colonel Mustard in the study with the candlestick. \\\n",
      "Mr. Green is not a very nice fellow.\",\n",
      " 'b' : \"Professor Plum has a green plant in his study.\",\n",
      " 'c' : \"Miss Scarlett watered Professor Plum's green plant while he was away \\\n",
      "from his office last week.\"\n",
      "}\n",
      "terms = {\n",
      " 'a' : [ i.lower() for i in corpus['a'].split() ],\n",
      " 'b' : [ i.lower() for i in corpus['b'].split() ],\n",
      " 'c' : [ i.lower() for i in corpus['c'].split() ]\n",
      " }"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 7. Running TF-IDF on sample data"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from math import log\n",
      "\n",
      "# XXX: Enter in a query term from the corpus variable\n",
      "QUERY_TERMS = ['mr.', 'green']\n",
      "\n",
      "def tf(term, doc, normalize=True):\n",
      "    doc = doc.lower().split()\n",
      "    if normalize:\n",
      "        return doc.count(term.lower()) / float(len(doc))\n",
      "    else:\n",
      "        return doc.count(term.lower()) / 1.0\n",
      "\n",
      "\n",
      "def idf(term, corpus):\n",
      "    num_texts_with_term = len([True for text in corpus if term.lower()\n",
      "                              in text.lower().split()])\n",
      "\n",
      "    # tf-idf calc involves multiplying against a tf value less than 0, so it's\n",
      "    # necessary to return a value greater than 1 for consistent scoring. \n",
      "    # (Multiplying two values less than 1 returns a value less than each of \n",
      "    # them.)\n",
      "\n",
      "    try:\n",
      "        return 1.0 + log(float(len(corpus)) / num_texts_with_term)\n",
      "    except ZeroDivisionError:\n",
      "        return 1.0\n",
      "\n",
      "\n",
      "def tf_idf(term, doc, corpus):\n",
      "    return tf(term, doc) * idf(term, corpus)\n",
      "\n",
      "\n",
      "corpus = \\\n",
      "    {'a': 'Mr. Green killed Colonel Mustard in the study with the candlestick. \\\n",
      "Mr. Green is not a very nice fellow.',\n",
      "     'b': 'Professor Plum has a green plant in his study.',\n",
      "     'c': \"Miss Scarlett watered Professor Plum's green plant while he was away \\\n",
      "from his office last week.\"}\n",
      "\n",
      "for (k, v) in sorted(corpus.items()):\n",
      "    print k, ':', v\n",
      "print\n",
      "    \n",
      "# Score queries by calculating cumulative tf_idf score for each term in query\n",
      "\n",
      "query_scores = {'a': 0, 'b': 0, 'c': 0}\n",
      "for term in [t.lower() for t in QUERY_TERMS]:\n",
      "    for doc in sorted(corpus):\n",
      "        print 'TF(%s): %s' % (doc, term), tf(term, corpus[doc])\n",
      "    print 'IDF: %s' % (term, ), idf(term, corpus.values())\n",
      "    print\n",
      "\n",
      "    for doc in sorted(corpus):\n",
      "        score = tf_idf(term, corpus[doc], corpus.values())\n",
      "        print 'TF-IDF(%s): %s' % (doc, term), score\n",
      "        query_scores[doc] += score\n",
      "    print\n",
      "\n",
      "print \"Overall TF-IDF scores for query '%s'\" % (' '.join(QUERY_TERMS), )\n",
      "for (doc, score) in sorted(query_scores.items()):\n",
      "    print doc, score"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 8. Exploring Google+ data with NLTK"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Explore some of NLTK's functionality by exploring the data. \n",
      "# Here are some suggestions for an interactive interpreter session.\n",
      "\n",
      "import nltk\n",
      "\n",
      "# Download ancillary nltk packages if not already installed\n",
      "nltk.download('stopwords')\n",
      "\n",
      "all_content = \" \".join([ a['object']['content'] for a in activity_results ])\n",
      "\n",
      "# Approximate bytes of text\n",
      "print len(all_content)\n",
      "\n",
      "tokens = all_content.split()\n",
      "text = nltk.Text(tokens)\n",
      "\n",
      "# Examples of the appearance of the word \"open\"\n",
      "text.concordance(\"open\")\n",
      "\n",
      "# Frequent collocations in the text (usually meaningful phrases)\n",
      "text.collocations()\n",
      "\n",
      "# Frequency analysis for words of interest\n",
      "fdist = text.vocab()\n",
      "fdist[\"open\"]\n",
      "fdist[\"source\"]\n",
      "fdist[\"web\"]\n",
      "fdist[\"2.0\"]\n",
      "\n",
      "# Number of words in the text\n",
      "len(tokens)\n",
      "\n",
      "# Number of unique words in the text\n",
      "\n",
      "len(fdist.keys())\n",
      "\n",
      "# Common words that aren't stopwords\n",
      "[w for w in fdist.keys()[:100] \\\n",
      "   if w.lower() not in nltk.corpus.stopwords.words('english')]\n",
      "\n",
      "# Long words that aren't URLs\n",
      "[w for w in fdist.keys() if len(w) > 15 and not w.startswith(\"http\")]\n",
      "\n",
      "# Number of URLs\n",
      "len([w for w in fdist.keys() if w.startswith(\"http\")])\n",
      "\n",
      "# Enumerate the frequency distribution\n",
      "for rank, word in enumerate(fdist): \n",
      "    print rank, word, fdist[word]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 9. Querying Google+ data with TF-IDF"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import nltk\n",
      "\n",
      "# Load in human language data from wherever you've saved it\n",
      "\n",
      "DATA = 'resources/ch04-googleplus/107033731246200681024.json'\n",
      "data = json.loads(open(DATA).read())\n",
      "\n",
      "# XXX: Provide your own query terms here\n",
      "\n",
      "QUERY_TERMS = ['SOPA']\n",
      "\n",
      "activities = [activity['object']['content'].lower().split() \\\n",
      "              for activity in data \\\n",
      "                if activity['object']['content'] != \"\"]\n",
      "\n",
      "# TextCollection provides tf, idf, and tf_idf abstractions so \n",
      "# that we don't have to maintain/compute them ourselves\n",
      "\n",
      "tc = nltk.TextCollection(activities)\n",
      "\n",
      "relevant_activities = []\n",
      "\n",
      "for idx in range(len(activities)):\n",
      "    score = 0\n",
      "    for term in [t.lower() for t in QUERY_TERMS]:\n",
      "        score += tc.tf_idf(term, activities[idx])\n",
      "    if score > 0:\n",
      "        relevant_activities.append({'score': score, 'title': data[idx]['title'],\n",
      "                              'url': data[idx]['url']})\n",
      "\n",
      "# Sort by score and display results\n",
      "\n",
      "relevant_activities = sorted(relevant_activities, \n",
      "                             key=lambda p: p['score'], reverse=True)\n",
      "for activity in relevant_activities:\n",
      "    print activity['title']\n",
      "    print '\\tLink: %s' % (activity['url'], )\n",
      "    print '\\tScore: %s' % (activity['score'], )\n",
      "    print"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 10. Finding similar documents using cosine similarity"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import nltk\n",
      "\n",
      "# Load in human language data from wherever you've saved it\n",
      "\n",
      "DATA = 'resources/ch04-googleplus/107033731246200681024.json'\n",
      "data = json.loads(open(DATA).read())\n",
      "\n",
      "# Only consider content that's ~1000+ words.\n",
      "data = [ post for post in json.loads(open(DATA).read())\n",
      "         if len(post['object']['content']) > 1000 ]\n",
      "\n",
      "all_posts = [post['object']['content'].lower().split() \n",
      "             for post in data ]\n",
      "\n",
      "\n",
      "# Provides tf, idf, and tf_idf abstractions for scoring\n",
      "\n",
      "tc = nltk.TextCollection(all_posts)\n",
      "\n",
      "# Compute a term-document matrix such that td_matrix[doc_title][term]\n",
      "# returns a tf-idf score for the term in the document\n",
      "\n",
      "td_matrix = {}\n",
      "for idx in range(len(all_posts)):\n",
      "    post = all_posts[idx]\n",
      "    fdist = nltk.FreqDist(post)\n",
      "\n",
      "    doc_title = data[idx]['title']\n",
      "    url = data[idx]['url']\n",
      "    td_matrix[(doc_title, url)] = {}\n",
      "\n",
      "    for term in fdist.iterkeys():\n",
      "        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)\n",
      "        \n",
      "# Build vectors such that term scores are in the same positions...\n",
      "\n",
      "distances = {}\n",
      "for (title1, url1) in td_matrix.keys():\n",
      "\n",
      "    distances[(title1, url1)] = {}\n",
      "    (min_dist, most_similar) = (1.0, ('', ''))\n",
      "\n",
      "    for (title2, url2) in td_matrix.keys():\n",
      "\n",
      "        # Take care not to mutate the original data structures\n",
      "        # since we're in a loop and need the originals multiple times\n",
      "\n",
      "        terms1 = td_matrix[(title1, url1)].copy()\n",
      "        terms2 = td_matrix[(title2, url2)].copy()\n",
      "\n",
      "        # Fill in \"gaps\" in each map so vectors of the same length can be computed\n",
      "\n",
      "        for term1 in terms1:\n",
      "            if term1 not in terms2:\n",
      "                terms2[term1] = 0\n",
      "\n",
      "        for term2 in terms2:\n",
      "            if term2 not in terms1:\n",
      "                terms1[term2] = 0\n",
      "\n",
      "        # Create vectors from term maps\n",
      "\n",
      "        v1 = [score for (term, score) in sorted(terms1.items())]\n",
      "        v2 = [score for (term, score) in sorted(terms2.items())]\n",
      "\n",
      "        # Compute similarity amongst documents\n",
      "\n",
      "        distances[(title1, url1)][(title2, url2)] = \\\n",
      "            nltk.cluster.util.cosine_distance(v1, v2)\n",
      "\n",
      "        if url1 == url2:\n",
      "            #print distances[(title1, url1)][(title2, url2)]\n",
      "            continue\n",
      "\n",
      "        if distances[(title1, url1)][(title2, url2)] < min_dist:\n",
      "            (min_dist, most_similar) = (distances[(title1, url1)][(title2,\n",
      "                                         url2)], (title2, url2))\n",
      "    \n",
      "    print '''Most similar to %s (%s)\n",
      "\\t%s (%s)\n",
      "\\tscore %f\n",
      "''' % (title1, url1,\n",
      "            most_similar[0], most_similar[1], 1-min_dist)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "**Code to create a matrix diagram displaying linkages between Google+ activities as illustrated in Figure 6.**"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "from operator import itemgetter\n",
      "import nltk\n",
      "from IPython.display import IFrame\n",
      "from IPython.core.display import display\n",
      "\n",
      "# Load in human language data from wherever you've saved it\n",
      "\n",
      "DATA = 'resources/ch04-googleplus/107033731246200681024.json'\n",
      "\n",
      "# Only consider content that's ~100+ words.\n",
      "data = [post for post in json.loads(open(DATA).read())\n",
      "        if len(post['object']['content']) > 1000]\n",
      "\n",
      "\n",
      "all_posts = [post['object']['content'].lower().split() \n",
      "             for post in data]\n",
      "\n",
      "# Provides tf, idf, tf_idf abstractions for scoring\n",
      "\n",
      "tc = nltk.TextCollection(all_posts)\n",
      "\n",
      "# Compute a term-document matrix such that td_matrix[doc_title][term]\n",
      "# returns a tf-idf score for the term in the document\n",
      "\n",
      "td_matrix = {}\n",
      "for idx in range(len(all_posts)):\n",
      "    post = all_posts[idx]\n",
      "    fdist = nltk.FreqDist(post)\n",
      "\n",
      "    doc_title = data[idx]['title']\n",
      "    url = data[idx]['url']\n",
      "    td_matrix[(doc_title, url)] = {}\n",
      "\n",
      "    for term in fdist.iterkeys():\n",
      "        td_matrix[(doc_title, url)][term] = tc.tf_idf(term, post)\n",
      "\n",
      "# Build vectors such that term scores are in the same positions...\n",
      "\n",
      "distances = {}\n",
      "\n",
      "# Visualization output requires a list of nodes with values and a list of links that have\n",
      "# source and destination targets. We'll pre-build the list of nodes here and create an index\n",
      "# so that we can easily create links from titles after we compute the most similar items\n",
      "# on each iteration of the outer loop\n",
      "\n",
      "viz_links = []\n",
      "viz_nodes = [ {'title' : title, 'url' : url} for (title, url) in td_matrix.keys() ]\n",
      "\n",
      "foo = 0\n",
      "for vn in viz_nodes:\n",
      "    vn.update({'idx' : foo})\n",
      "    foo += 1\n",
      "\n",
      "idx = dict(zip([ vn['title'] for vn in viz_nodes ], range(len(viz_nodes))))\n",
      "\n",
      "\n",
      "for (title1, url1) in td_matrix.keys():\n",
      "\n",
      "    distances[(title1, url1)] = {}\n",
      "    (min_dist, most_similar) = (1.0, ('', ''))\n",
      "\n",
      "    for (title2, url2) in td_matrix.keys():\n",
      "\n",
      "        # Take care not to mutate the original data structures\n",
      "        # since we're in a loop and need the originals multiple times\n",
      "\n",
      "        terms1 = td_matrix[(title1, url1)].copy()\n",
      "        terms2 = td_matrix[(title2, url2)].copy()\n",
      "\n",
      "        # Fill in \"gaps\" in each map so vectors of the same length can be computed\n",
      "\n",
      "        for term1 in terms1:\n",
      "            if term1 not in terms2:\n",
      "                terms2[term1] = 0\n",
      "\n",
      "        for term2 in terms2:\n",
      "            if term2 not in terms1:\n",
      "                terms1[term2] = 0\n",
      "\n",
      "        # Create vectors from term maps\n",
      "\n",
      "        v1 = [score for (term, score) in sorted(terms1.items())]\n",
      "        v2 = [score for (term, score) in sorted(terms2.items())]\n",
      "\n",
      "        # Compute similarity amongst documents\n",
      "\n",
      "        distances[(title1, url1)][(title2, url2)] = \\\n",
      "            nltk.cluster.util.cosine_distance(v1, v2)\n",
      "\n",
      "        if url1 == url2:\n",
      "            #print distances[(title1, url1)][(title2, url2)]\n",
      "            continue\n",
      "\n",
      "        if distances[(title1, url1)][(title2, url2)] < min_dist:\n",
      "            (min_dist, most_similar) = (distances[(title1, url1)][(title2,\n",
      "                                         url2)], (title2, url2))\n",
      "    \n",
      "    viz_links.append({'source' : idx[title1], 'target' : idx[most_similar[0]], 'score' : 1 - min_dist})\n",
      "    \n",
      "\n",
      "f = open('resources/ch04-googleplus/viz/matrix.json', 'w')\n",
      "f.write(json.dumps({'nodes' : viz_nodes, 'links' : viz_links}, indent=1))\n",
      "f.close()\n",
      "\n",
      "# Display the visualization below with an inline frame\n",
      "display(IFrame('files/resources/ch04-googleplus/viz/matrix.html', '100%', '600px'))\n",
      "\n",
      "# You could also serve it by running SimpleHTTPServer in the viz directory as follows:\n",
      "# $ python -m SimpleHTTPServer 9000\n",
      "# Now, open http://localhost:9000/matrix.html in your web browser"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 11. Using NLTK to compute bigrams and collocations for a sentence"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import nltk\n",
      "\n",
      "sentence = \"Mr. Green killed Colonel Mustard in the study with the \" + \\\n",
      "           \"candlestick. Mr. Green is not a very nice fellow.\"\n",
      "\n",
      "print nltk.ngrams(sentence.split(), 2)\n",
      "txt = nltk.Text(sentence.split())\n",
      "\n",
      "txt.collocations()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "heading",
     "level": 2,
     "metadata": {},
     "source": [
      "Example 12. Using NLTK to compute collocations in a similar manner to the nltk.Text.collocations demo functionality"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import json\n",
      "import nltk\n",
      "\n",
      "# Load in human language data from wherever you've saved it\n",
      "\n",
      "DATA = 'resources/ch04-googleplus/107033731246200681024.json'\n",
      "data = json.loads(open(DATA).read())\n",
      "\n",
      "# Number of collocations to find\n",
      "\n",
      "N = 25\n",
      "\n",
      "all_tokens = [token for activity in data for token in activity['object']['content'\n",
      "              ].lower().split()]\n",
      "\n",
      "finder = nltk.BigramCollocationFinder.from_words(all_tokens)\n",
      "finder.apply_freq_filter(2)\n",
      "finder.apply_word_filter(lambda w: w in nltk.corpus.stopwords.words('english'))\n",
      "scorer = nltk.metrics.BigramAssocMeasures.jaccard\n",
      "collocations = finder.nbest(scorer, N)\n",
      "\n",
      "for collocation in collocations:\n",
      "    c = ' '.join(collocation)\n",
      "    print c"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}